online_shoppers <- data.frame(read.csv("online_shoppers_intention.csv"))
str(online_shoppers)
## 'data.frame': 12330 obs. of 18 variables:
## $ Administrative : int 0 0 0 0 0 0 0 1 0 0 ...
## $ Administrative_Duration: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Informational : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Informational_Duration : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductRelated : int 1 2 1 2 10 19 1 0 2 3 ...
## $ ProductRelated_Duration: num 0 64 0 2.67 627.5 ...
## $ BounceRates : num 0.2 0 0.2 0.05 0.02 ...
## $ ExitRates : num 0.2 0.1 0.2 0.14 0.05 ...
## $ PageValues : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SpecialDay : num 0 0 0 0 0 0 0.4 0 0.8 0.4 ...
## $ Month : chr "Feb" "Feb" "Feb" "Feb" ...
## $ OperatingSystems : int 1 2 4 3 3 2 2 1 2 2 ...
## $ Browser : int 1 2 1 2 3 2 4 2 2 4 ...
## $ Region : int 1 1 9 2 1 1 3 1 2 1 ...
## $ TrafficType : int 1 2 3 4 4 3 3 5 3 2 ...
## $ VisitorType : chr "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" ...
## $ Weekend : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
## $ Revenue : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
dim(online_shoppers)
## [1] 12330 18
There are no NAs in the dataset, and no ommissions in the end.
which(is.na(online_shoppers))
## integer(0)
online_shoppers <- na.omit(online_shoppers)
str(online_shoppers)
## 'data.frame': 12330 obs. of 18 variables:
## $ Administrative : int 0 0 0 0 0 0 0 1 0 0 ...
## $ Administrative_Duration: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Informational : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Informational_Duration : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductRelated : int 1 2 1 2 10 19 1 0 2 3 ...
## $ ProductRelated_Duration: num 0 64 0 2.67 627.5 ...
## $ BounceRates : num 0.2 0 0.2 0.05 0.02 ...
## $ ExitRates : num 0.2 0.1 0.2 0.14 0.05 ...
## $ PageValues : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SpecialDay : num 0 0 0 0 0 0 0.4 0 0.8 0.4 ...
## $ Month : chr "Feb" "Feb" "Feb" "Feb" ...
## $ OperatingSystems : int 1 2 4 3 3 2 2 1 2 2 ...
## $ Browser : int 1 2 1 2 3 2 4 2 2 4 ...
## $ Region : int 1 1 9 2 1 1 3 1 2 1 ...
## $ TrafficType : int 1 2 3 4 4 3 3 5 3 2 ...
## $ VisitorType : chr "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" ...
## $ Weekend : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
## $ Revenue : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
Some of the features are not in their correct data type. We will factorize some of the variables. Some of the factored features have lots of levels, which could be a problem when designing models with dummies and increase computing expense.
online_shoppers$OperatingSystems = factor(online_shoppers$OperatingSystems)
online_shoppers$Browser = factor(online_shoppers$Browser)
online_shoppers$Region = factor(online_shoppers$Region)
online_shoppers$TrafficType = factor(online_shoppers$TrafficType)
online_shoppers$VisitorType = factor(online_shoppers$VisitorType)
online_shoppers$Weekend = factor(online_shoppers$Weekend)
online_shoppers$Revenue = factor(online_shoppers$Revenue)
# online_shoppers$Month = factor(online_shoppers$Month)
str(online_shoppers)
## 'data.frame': 12330 obs. of 18 variables:
## $ Administrative : int 0 0 0 0 0 0 0 1 0 0 ...
## $ Administrative_Duration: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Informational : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Informational_Duration : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductRelated : int 1 2 1 2 10 19 1 0 2 3 ...
## $ ProductRelated_Duration: num 0 64 0 2.67 627.5 ...
## $ BounceRates : num 0.2 0 0.2 0.05 0.02 ...
## $ ExitRates : num 0.2 0.1 0.2 0.14 0.05 ...
## $ PageValues : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SpecialDay : num 0 0 0 0 0 0 0.4 0 0.8 0.4 ...
## $ Month : chr "Feb" "Feb" "Feb" "Feb" ...
## $ OperatingSystems : Factor w/ 8 levels "1","2","3","4",..: 1 2 4 3 3 2 2 1 2 2 ...
## $ Browser : Factor w/ 13 levels "1","2","3","4",..: 1 2 1 2 3 2 4 2 2 4 ...
## $ Region : Factor w/ 9 levels "1","2","3","4",..: 1 1 9 2 1 1 3 1 2 1 ...
## $ TrafficType : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 4 3 3 5 3 2 ...
## $ VisitorType : Factor w/ 3 levels "New_Visitor",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Weekend : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 2 1 1 2 1 1 ...
## $ Revenue : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
Let’s do an initial mixed corrplot to see the relationship between the features and the possibility of generating revenue. We will only look at the numerical/ordinal factors without the factors.
library(corrplot)
online_shoppers_num = subset(online_shoppers, select = -c(Month, OperatingSystems, Browser, Region, TrafficType, VisitorType))
online_shoppers_num$Weekend = as.numeric(online_shoppers_num$Weekend)
online_shoppers_num$Revenue = as.numeric(online_shoppers_num$Revenue)
str(online_shoppers_num)
## 'data.frame': 12330 obs. of 12 variables:
## $ Administrative : int 0 0 0 0 0 0 0 1 0 0 ...
## $ Administrative_Duration: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Informational : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Informational_Duration : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductRelated : int 1 2 1 2 10 19 1 0 2 3 ...
## $ ProductRelated_Duration: num 0 64 0 2.67 627.5 ...
## $ BounceRates : num 0.2 0 0.2 0.05 0.02 ...
## $ ExitRates : num 0.2 0.1 0.2 0.14 0.05 ...
## $ PageValues : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SpecialDay : num 0 0 0 0 0 0 0.4 0 0.8 0.4 ...
## $ Weekend : num 1 1 1 1 2 1 1 2 1 1 ...
## $ Revenue : num 1 1 1 1 1 1 1 1 1 1 ...
shop_corr <- cor(online_shoppers_num)
colnames(shop_corr) <- c("Adm", "A_D", "Info", "I_D", "PR", "PR_D", "BR", "ER", "PV", "SDay", "WE", "Rev")
corrplot.mixed(shop_corr)
We observed some very obvious correlation between things like the number of administrative pages visited and the amount of time each visitor spent on administrative pages, which is expected just like informational/informational_duration and productedRelated/productRelated_Duration.
The highest correlation with our response is the PageValues, which had a .49 correlation with Revenue. Again, this higher correlation makes sense because the PageValues feature is an average value/metric of the page visited before the user completed a transaction. This is expected, as pagevalues We will do some more EDA on that.
Other than that, the Bounce and Exit Rates both have a negative correlation with the likelihood of generating revenue, while users that visited productRelated pages seems to more likely generate revenue over users that visited admin pages.
From the t-test, we reject the null, and bring strong evidence to support the alternative hypothesis that the PageValue has a strong effect on whether or not the shopper buys.
online_shopper_bought <- online_shoppers[which(online_shoppers$Revenue==TRUE),]
online_shopper_not <- online_shoppers[which(online_shoppers$Revenue==FALSE),]
t.test(online_shopper_bought$PageValues, online_shopper_not$PageValues, conf.level = .95)
##
## Welch Two Sample t-test
##
## data: online_shopper_bought$PageValues and online_shopper_not$PageValues
## t = 31, df = 1954, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 23.7 26.9
## sample estimates:
## mean of x mean of y
## 27.26 1.98
Basic Logistic Regression with PageValues and the three types of pages.
unloadPkg("caret")
attach(online_shoppers)
ShopLogit <- glm(Revenue ~ PageValues + ProductRelated + Informational + Administrative + ExitRates, data = online_shoppers, family = "binomial")
summary(ShopLogit)
##
## Call:
## glm(formula = Revenue ~ PageValues + ProductRelated + Informational +
## Administrative + ExitRates, family = "binomial", data = online_shoppers)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.997 -0.473 -0.387 -0.199 3.409
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.04e+00 6.62e-02 -30.79 <2e-16 ***
## PageValues 7.99e-02 2.32e-03 34.42 <2e-16 ***
## ProductRelated 5.16e-03 5.89e-04 8.77 <2e-16 ***
## Informational 3.52e-02 2.17e-02 1.62 0.10
## Administrative 5.55e-04 9.21e-03 0.06 0.95
## ExitRates -1.89e+01 1.62e+00 -11.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10624.8 on 12329 degrees of freedom
## Residual deviance: 7448.3 on 12324 degrees of freedom
## AIC: 7460
##
## Number of Fisher Scoring iterations: 6
#kabledply(confusionMatrix(actual=ShopLogit$y,predicted=ShopLogit$fitted.values), title = "Confusion Matrix")
confusionMatrix(actual=ShopLogit$y,predicted=ShopLogit$fitted.values)
## [,1] [,2]
## [1,] 10192 1205
## [2,] 230 703
prob = predict(ShopLogit, type = "response")
h <- roc(Revenue~prob)
auc(h)
## Area under the curve: 0.882
plot(h)
ShopperNullLogit <- glm(Revenue ~ 1, family = "binomial")
mcFadden = 1 - logLik(ShopLogit)/logLik(ShopperNullLogit)
mcFadden
## 'log Lik.' 0.299 (df=6)
pR2(ShopLogit)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML r2CU
## -3724.131 -5312.398 3176.535 0.299 0.227 0.393
loadPkg("caret")
This model looks to be a good model already. The coefficients are all statistically significant aside from the Informational pages, and the AIC is 7675.8. The AUC is .8883, which is higher than 0.8, which is evidence that the predictors are doing well in classifying the dataset.
Let’s create testing and training data from Cross Validation to navigate around overfitting and possible variance in creating testing/training data
set.seed(123)
train_control <- trainControl(method = "cv", number = 10)
ShopModelCV <- train(Revenue ~., data = online_shoppers, method = "glm", family = "binomial", trControl = train_control)
summary(ShopModelCV)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.142 -0.463 -0.324 -0.154 3.099
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.85e+00 2.17e-01 -8.55 < 2e-16 ***
## Administrative 3.04e-03 1.11e-02 0.27 0.78386
## Administrative_Duration -9.97e-05 1.95e-04 -0.51 0.60815
## Informational 3.16e-02 2.72e-02 1.16 0.24551
## Informational_Duration 5.27e-05 2.24e-04 0.24 0.81402
## ProductRelated 1.89e-03 1.17e-03 1.61 0.10661
## ProductRelated_Duration 5.96e-05 2.73e-05 2.18 0.02931 *
## BounceRates -1.69e+00 3.22e+00 -0.52 0.59988
## ExitRates -1.53e+01 2.44e+00 -6.26 3.8e-10 ***
## PageValues 8.19e-02 2.44e-03 33.60 < 2e-16 ***
## SpecialDay -9.51e-02 2.37e-01 -0.40 0.68837
## MonthDec -7.11e-01 1.88e-01 -3.79 0.00015 ***
## MonthFeb -1.71e+00 6.43e-01 -2.65 0.00795 **
## MonthJul 1.01e-01 2.20e-01 0.46 0.64620
## MonthJune -3.37e-01 2.77e-01 -1.22 0.22410
## MonthMar -5.82e-01 1.86e-01 -3.13 0.00174 **
## MonthMay -5.33e-01 1.76e-01 -3.03 0.00249 **
## MonthNov 4.60e-01 1.68e-01 2.74 0.00611 **
## MonthOct -5.82e-02 2.05e-01 -0.28 0.77607
## MonthSep 1.58e-02 2.13e-01 0.07 0.94093
## OperatingSystems2 1.55e-01 1.65e-01 0.94 0.34606
## OperatingSystems3 -4.16e-02 1.77e-01 -0.23 0.81421
## OperatingSystems4 -3.62e-02 1.81e-01 -0.20 0.84123
## OperatingSystems5 3.56e-02 1.23e+00 0.03 0.97683
## OperatingSystems6 -1.03e+00 9.47e-01 -1.09 0.27615
## OperatingSystems7 1.17e+00 1.14e+00 1.02 0.30667
## OperatingSystems8 3.03e-01 6.41e-01 0.47 0.63677
## Browser2 -1.20e-01 1.65e-01 -0.72 0.47022
## Browser3 -1.00e+00 5.71e-01 -1.76 0.07839 .
## Browser4 -5.01e-02 2.09e-01 -0.24 0.81046
## Browser5 8.74e-02 2.25e-01 0.39 0.69703
## Browser6 -4.06e-01 3.34e-01 -1.21 0.22489
## Browser7 -9.36e-02 5.06e-01 -0.18 0.85340
## Browser8 1.87e-01 3.18e-01 0.59 0.55625
## Browser9 -1.21e+01 1.46e+03 -0.01 0.99339
## Browser10 1.34e-01 2.95e-01 0.45 0.65082
## Browser11 NA NA NA NA
## Browser12 1.79e+00 7.82e-01 2.30 0.02171 *
## Browser13 9.06e-02 9.89e-01 0.09 0.92700
## Region2 1.52e-01 1.12e-01 1.36 0.17377
## Region3 -9.74e-03 8.66e-02 -0.11 0.91050
## Region4 -5.19e-02 1.15e-01 -0.45 0.65106
## Region5 -2.33e-01 2.12e-01 -1.10 0.27021
## Region6 5.45e-02 1.34e-01 0.41 0.68362
## Region7 -4.42e-03 1.36e-01 -0.03 0.97397
## Region8 6.35e-02 1.75e-01 0.36 0.71727
## Region9 -2.68e-01 1.74e-01 -1.54 0.12349
## TrafficType2 1.84e-01 9.53e-02 1.93 0.05392 .
## TrafficType3 -2.31e-01 1.24e-01 -1.86 0.06315 .
## TrafficType4 4.72e-02 1.41e-01 0.34 0.73751
## TrafficType5 2.51e-01 2.15e-01 1.17 0.24270
## TrafficType6 -8.43e-02 1.99e-01 -0.42 0.67115
## TrafficType7 3.35e-01 4.77e-01 0.70 0.48263
## TrafficType8 6.13e-01 1.80e-01 3.41 0.00065 ***
## TrafficType9 6.28e-03 6.80e-01 0.01 0.99263
## TrafficType10 3.83e-01 1.67e-01 2.30 0.02156 *
## TrafficType11 4.14e-01 2.20e-01 1.88 0.05961 .
## TrafficType12 -1.20e+01 1.46e+03 -0.01 0.99342
## TrafficType13 -5.28e-01 2.01e-01 -2.63 0.00855 **
## TrafficType14 -5.75e-01 1.13e+00 -0.51 0.61078
## TrafficType15 -1.23e+01 2.19e+02 -0.06 0.95534
## TrafficType16 1.93e+00 1.24e+00 1.56 0.11907
## TrafficType17 -1.17e+01 1.46e+03 -0.01 0.99356
## TrafficType18 -1.25e+01 4.45e+02 -0.03 0.97764
## TrafficType19 -1.15e+00 1.40e+00 -0.82 0.41143
## TrafficType20 4.82e-01 2.72e-01 1.77 0.07614 .
## VisitorTypeOther -6.56e-01 7.59e-01 -0.86 0.38737
## VisitorTypeReturning_Visitor -2.11e-01 9.11e-02 -2.31 0.02083 *
## WeekendTRUE 8.86e-02 7.25e-02 1.22 0.22181
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10624.8 on 12329 degrees of freedom
## Residual deviance: 7078.7 on 12262 degrees of freedom
## AIC: 7215
##
## Number of Fisher Scoring iterations: 14
unloadPkg("caret")
xkabledply(confusionMatrix(actual=ShopModelCV$finalModel$y,predicted=ShopModelCV$finalModel$fitted.values), title = "Confusion Matrix")
| 1 | 2 |
|---|---|
| 10176 | 1164 |
| 246 | 744 |
prob = predict(ShopModelCV$finalModel, type = "response")
h <- roc(Revenue~prob)
auc(h)
## Area under the curve: 0.9
plot(h)
ShopperNullLogit <- glm(Revenue ~ 1, family = "binomial")
mcFadden = 1 - logLik(ShopModelCV$finalModel)/logLik(ShopperNullLogit)
mcFadden
## 'log Lik.' 0.334 (df=68)
loadPkg("caret")
Let’s reduce the number of features in this model, particularly the less significant ones. Starting from the null model, let’s add the strongest predictors and move in both directions to achieve the best BIC.
full = ShopModelCV
null = glm(Revenue ~ 1, family = "binomial")
step(null, scope = list(lower=null, upper=full), direction = "both", criterion = "AIC")
## Start: AIC=10627
## Revenue ~ 1
##
## Df Deviance AIC
## + PageValues 1 7889 7893
## + ExitRates 1 9666 9670
## + BounceRates 1 10088 10092
## + Month 9 10244 10264
## + TrafficType 19 10235 10275
## + ProductRelated 1 10380 10384
## + ProductRelated_Duration 1 10391 10395
## + Administrative 1 10416 10420
## + VisitorType 2 10504 10510
## + SpecialDay 1 10518 10522
## + Informational 1 10531 10535
## + Administrative_Duration 1 10538 10542
## + OperatingSystems 7 10546 10562
## + Informational_Duration 1 10576 10580
## + Weekend 1 10614 10618
## + Browser 12 10595 10621
## <none> 10625 10627
## + Region 8 10615 10633
##
## Step: AIC=7893
## Revenue ~ PageValues
##
## Df Deviance AIC
## + Month 9 7533 7555
## + ExitRates 1 7556 7562
## + BounceRates 1 7673 7679
## + ProductRelated 1 7679 7685
## + ProductRelated_Duration 1 7702 7708
## + TrafficType 19 7710 7752
## + Administrative 1 7804 7810
## + Informational 1 7834 7840
## + SpecialDay 1 7845 7851
## + Administrative_Duration 1 7853 7859
## + VisitorType 2 7855 7863
## + Informational_Duration 1 7858 7864
## + OperatingSystems 7 7866 7884
## + Weekend 1 7879 7885
## <none> 7889 7893
## + Browser 12 7871 7899
## + Region 8 7884 7904
## - PageValues 1 10625 10627
##
## Step: AIC=7555
## Revenue ~ PageValues + Month
##
## Df Deviance AIC
## + ExitRates 1 7240 7264
## + BounceRates 1 7333 7357
## + ProductRelated 1 7410 7434
## + ProductRelated_Duration 1 7418 7442
## + TrafficType 19 7408 7468
## + Administrative 1 7477 7501
## + Informational 1 7492 7516
## + VisitorType 2 7503 7529
## + Administrative_Duration 1 7507 7531
## + Informational_Duration 1 7509 7533
## + OperatingSystems 7 7505 7541
## + Weekend 1 7527 7551
## + SpecialDay 1 7530 7554
## <none> 7533 7555
## + Browser 12 7516 7562
## + Region 8 7526 7564
## - Month 9 7889 7893
## - PageValues 1 10244 10264
##
## Step: AIC=7264
## Revenue ~ PageValues + Month + ExitRates
##
## Df Deviance AIC
## + ProductRelated_Duration 1 7186 7212
## + ProductRelated 1 7192 7218
## + TrafficType 19 7183 7245
## + Informational 1 7222 7248
## + Informational_Duration 1 7227 7253
## + Administrative 1 7231 7257
## + Administrative_Duration 1 7235 7261
## + OperatingSystems 7 7223 7261
## + VisitorType 2 7235 7263
## <none> 7240 7264
## + Weekend 1 7238 7264
## + BounceRates 1 7240 7266
## + SpecialDay 1 7240 7266
## + Browser 12 7226 7274
## + Region 8 7234 7274
## - ExitRates 1 7533 7555
## - Month 9 7556 7562
## - PageValues 1 9369 9391
##
## Step: AIC=7212
## Revenue ~ PageValues + Month + ExitRates + ProductRelated_Duration
##
## Df Deviance AIC
## + TrafficType 19 7122 7186
## + VisitorType 2 7172 7202
## + Informational 1 7184 7212
## + Weekend 1 7184 7212
## <none> 7186 7212
## + BounceRates 1 7185 7213
## + ProductRelated 1 7185 7213
## + Informational_Duration 1 7185 7213
## + OperatingSystems 7 7173 7213
## + Administrative 1 7186 7214
## + SpecialDay 1 7186 7214
## + Administrative_Duration 1 7186 7214
## + Browser 12 7170 7220
## + Region 8 7180 7222
## - ProductRelated_Duration 1 7240 7264
## - ExitRates 1 7418 7442
## - Month 9 7453 7461
## - PageValues 1 9331 9355
##
## Step: AIC=7186
## Revenue ~ PageValues + Month + ExitRates + ProductRelated_Duration +
## TrafficType
##
## Df Deviance AIC
## + VisitorType 2 7116 7184
## + ProductRelated 1 7119 7185
## <none> 7122 7186
## + Informational 1 7120 7186
## + Weekend 1 7120 7186
## + Informational_Duration 1 7121 7187
## + BounceRates 1 7121 7187
## + Administrative 1 7121 7187
## + Administrative_Duration 1 7121 7187
## + SpecialDay 1 7121 7187
## + OperatingSystems 7 7111 7189
## + Browser 12 7106 7194
## + Region 8 7115 7195
## - TrafficType 19 7186 7212
## - ProductRelated_Duration 1 7183 7245
## - ExitRates 1 7294 7356
## - Month 9 7355 7401
## - PageValues 1 9220 9282
##
## Step: AIC=7184
## Revenue ~ PageValues + Month + ExitRates + ProductRelated_Duration +
## TrafficType + VisitorType
##
## Df Deviance AIC
## + ProductRelated 1 7113 7183
## <none> 7116 7184
## + Informational 1 7114 7184
## + Weekend 1 7115 7185
## + Informational_Duration 1 7115 7185
## + BounceRates 1 7116 7186
## - VisitorType 2 7122 7186
## + Administrative 1 7116 7186
## + Administrative_Duration 1 7116 7186
## + SpecialDay 1 7116 7186
## + OperatingSystems 7 7106 7188
## + Browser 12 7101 7193
## + Region 8 7109 7193
## - TrafficType 19 7172 7202
## - ProductRelated_Duration 1 7182 7248
## - ExitRates 1 7273 7339
## - Month 9 7348 7398
## - PageValues 1 9209 9275
##
## Step: AIC=7183
## Revenue ~ PageValues + Month + ExitRates + ProductRelated_Duration +
## TrafficType + VisitorType + ProductRelated
##
## Df Deviance AIC
## <none> 7113 7183
## + Informational 1 7111 7183
## + Weekend 1 7112 7184
## + Informational_Duration 1 7112 7184
## - ProductRelated 1 7116 7184
## + BounceRates 1 7112 7184
## + SpecialDay 1 7113 7185
## + Administrative_Duration 1 7113 7185
## + Administrative 1 7113 7185
## - VisitorType 2 7119 7185
## + OperatingSystems 7 7103 7187
## - ProductRelated_Duration 1 7120 7188
## + Browser 12 7098 7192
## + Region 8 7106 7192
## - TrafficType 19 7170 7202
## - ExitRates 1 7258 7326
## - Month 9 7336 7388
## - PageValues 1 9209 9277
##
## Call: glm(formula = Revenue ~ PageValues + Month + ExitRates + ProductRelated_Duration +
## TrafficType + VisitorType + ProductRelated, family = "binomial")
##
## Coefficients:
## (Intercept) PageValues
## -1.80e+00 8.17e-02
## MonthDec MonthFeb
## -7.10e-01 -1.78e+00
## MonthJul MonthJune
## 1.07e-01 -3.41e-01
## MonthMar MonthMay
## -5.73e-01 -5.53e-01
## MonthNov MonthOct
## 4.45e-01 -6.32e-02
## MonthSep ExitRates
## -8.51e-03 -1.63e+01
## ProductRelated_Duration TrafficType2
## 6.57e-05 1.83e-01
## TrafficType3 TrafficType4
## -2.51e-01 4.32e-02
## TrafficType5 TrafficType6
## 2.52e-01 -8.73e-02
## TrafficType7 TrafficType8
## 3.48e-01 5.83e-01
## TrafficType9 TrafficType10
## -3.11e-02 3.50e-01
## TrafficType11 TrafficType12
## 4.07e-01 -1.19e+01
## TrafficType13 TrafficType14
## -6.02e-01 -4.32e-01
## TrafficType15 TrafficType16
## -1.23e+01 1.95e+00
## TrafficType17 TrafficType18
## -1.18e+01 -1.25e+01
## TrafficType19 TrafficType20
## -1.14e+00 4.58e-01
## VisitorTypeOther VisitorTypeReturning_Visitor
## -5.83e-01 -2.14e-01
## ProductRelated
## 2.04e-03
##
## Degrees of Freedom: 12329 Total (i.e. Null); 12295 Residual
## Null Deviance: 10600
## Residual Deviance: 7110 AIC: 7180
Now we have our full logistic model, which is glm(formula = Revenue ~ PageValues + Month + ExitRates + ProductRelated_Duration + TrafficType + VisitorType + ProductRelated, family = “binomial”)
Let’s run it through the same cross validation training/testing data.
ShopModelCV <- train(Revenue ~ PageValues + Month + ExitRates + ProductRelated_Duration + TrafficType + VisitorType + ProductRelated, data = online_shoppers, method = "glm", family = "binomial", trControl = train_control)
summary(ShopModelCV)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.146 -0.464 -0.328 -0.155 3.126
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.80e+00 1.94e-01 -9.27 < 2e-16 ***
## PageValues 8.17e-02 2.40e-03 34.04 < 2e-16 ***
## MonthDec -7.10e-01 1.86e-01 -3.81 0.00014 ***
## MonthFeb -1.78e+00 6.44e-01 -2.76 0.00580 **
## MonthJul 1.07e-01 2.19e-01 0.49 0.62534
## MonthJune -3.41e-01 2.77e-01 -1.23 0.21798
## MonthMar -5.73e-01 1.85e-01 -3.10 0.00194 **
## MonthMay -5.53e-01 1.71e-01 -3.23 0.00123 **
## MonthNov 4.45e-01 1.67e-01 2.67 0.00758 **
## MonthOct -6.32e-02 2.04e-01 -0.31 0.75640
## MonthSep -8.51e-03 2.13e-01 -0.04 0.96809
## ExitRates -1.63e+01 1.67e+00 -9.77 < 2e-16 ***
## ProductRelated_Duration 6.57e-05 2.61e-05 2.52 0.01182 *
## TrafficType2 1.83e-01 9.39e-02 1.94 0.05195 .
## TrafficType3 -2.51e-01 1.23e-01 -2.04 0.04129 *
## TrafficType4 4.32e-02 1.39e-01 0.31 0.75672
## TrafficType5 2.52e-01 2.13e-01 1.18 0.23659
## TrafficType6 -8.73e-02 1.97e-01 -0.44 0.65807
## TrafficType7 3.48e-01 4.72e-01 0.74 0.46088
## TrafficType8 5.83e-01 1.77e-01 3.29 0.00101 **
## TrafficType9 -3.11e-02 6.71e-01 -0.05 0.96303
## TrafficType10 3.50e-01 1.65e-01 2.12 0.03402 *
## TrafficType11 4.07e-01 2.10e-01 1.94 0.05230 .
## TrafficType12 -1.19e+01 1.46e+03 -0.01 0.99348
## TrafficType13 -6.02e-01 1.98e-01 -3.04 0.00235 **
## TrafficType14 -4.32e-01 1.07e+00 -0.40 0.68668
## TrafficType15 -1.23e+01 2.21e+02 -0.06 0.95539
## TrafficType16 1.95e+00 1.24e+00 1.58 0.11520
## TrafficType17 -1.18e+01 1.46e+03 -0.01 0.99354
## TrafficType18 -1.25e+01 4.45e+02 -0.03 0.97758
## TrafficType19 -1.14e+00 1.41e+00 -0.81 0.41803
## TrafficType20 4.58e-01 2.61e-01 1.76 0.07901 .
## VisitorTypeOther -5.83e-01 5.43e-01 -1.07 0.28311
## VisitorTypeReturning_Visitor -2.14e-01 9.03e-02 -2.37 0.01789 *
## ProductRelated 2.04e-03 1.12e-03 1.83 0.06794 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10624.8 on 12329 degrees of freedom
## Residual deviance: 7112.9 on 12295 degrees of freedom
## AIC: 7183
##
## Number of Fisher Scoring iterations: 14
The new AIC of the model is 7182.9, which is a noticeable improvement on the original full model of 7214.7. Let’s apply it to the test data (4:1 on random seed 123) and see how it does.
unloadPkg("caret")
xkabledply(confusionMatrix(actual=ShopModelCV$finalModel$y,predicted=ShopModelCV$finalModel$fitted.values), title = "Confusion Matrix")
| 1 | 2 |
|---|---|
| 10177 | 1168 |
| 245 | 740 |
prob = predict(ShopModelCV$finalModel, type = "response")
h <- roc(Revenue~prob)
auc(h)
## Area under the curve: 0.899
plot(h)
ShopperNullLogit <- glm(Revenue ~ 1, family = "binomial")
mcFadden = 1 - logLik(ShopModelCV$finalModel)/logLik(ShopperNullLogit)
mcFadden
## 'log Lik.' 0.331 (df=35)
Finally, let’s move the cutoff value to get a high sensitivity without sacrificing too much accuracy. Let’s try lowering it to .2.
xkabledply(confusionMatrix(actual=ShopModelCV$finalModel$y,predicted=ShopModelCV$finalModel$fitted.values, cutoff=.2), title = "Confusion Matrix")
| 1 | 2 |
|---|---|
| 9331 | 561 |
| 1091 | 1347 |
prob = predict(ShopModelCV$finalModel, type = "response")
h <- roc(Revenue~prob)
auc(h)
## Area under the curve: 0.899
plot(h)
ShopperNullLogit <- glm(Revenue ~ 1, family = "binomial")
mcFadden = 1 - logLik(ShopModelCV$finalModel)/logLik(ShopperNullLogit)
mcFadden
## 'log Lik.' 0.331 (df=35)
All of the models had generally high accuracies and AUCs, but I would choose the final model as it has the lowest AIC out of the three models. While the McFadden and AUC of the full model was slightly higher than that of the last model, as the complexity is much less with only 7 out of the possible 17 different features.
#First, load the file
df_orig = data.frame(read.csv("online_shoppers_intention.csv"))
head(df_orig)
## Administrative Administrative_Duration Informational Informational_Duration
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues
## 1 1 0.00 0.2000 0.2000 0
## 2 2 64.00 0.0000 0.1000 0
## 3 1 0.00 0.2000 0.2000 0
## 4 2 2.67 0.0500 0.1400 0
## 5 10 627.50 0.0200 0.0500 0
## 6 19 154.22 0.0158 0.0246 0
## SpecialDay Month OperatingSystems Browser Region TrafficType
## 1 0 Feb 1 1 1 1
## 2 0 Feb 2 2 1 2
## 3 0 Feb 4 1 9 3
## 4 0 Feb 3 2 2 4
## 5 0 Feb 3 3 1 4
## 6 0 Feb 2 2 1 3
## VisitorType Weekend Revenue
## 1 Returning_Visitor FALSE FALSE
## 2 Returning_Visitor FALSE FALSE
## 3 Returning_Visitor FALSE FALSE
## 4 Returning_Visitor FALSE FALSE
## 5 Returning_Visitor TRUE FALSE
## 6 Returning_Visitor FALSE FALSE
colSums(is.na(df_orig))
## Administrative Administrative_Duration Informational
## 0 0 0
## Informational_Duration ProductRelated ProductRelated_Duration
## 0 0 0
## BounceRates ExitRates PageValues
## 0 0 0
## SpecialDay Month OperatingSystems
## 0 0 0
## Browser Region TrafficType
## 0 0 0
## VisitorType Weekend Revenue
## 0 0 0
No null values in the dataset.
table(df_orig$Revenue)
##
## FALSE TRUE
## 10422 1908
We can see there is an imbalance in dataset. True values are less than 20% as compares to False.
df = data.frame(df_orig)
df$VisitorType = factor(df$VisitorType)
df$Weekend = factor(df$Weekend)
df$Revenue = factor(df$Revenue)
df$Month = factor(df$Month)
df[, c('VisitorType', 'Weekend', 'Revenue', 'Month')] <- sapply(df[, c('VisitorType', 'Weekend', 'Revenue', 'Month')], unclass)
df$VisitorType = factor(df$VisitorType)
df$Weekend = factor(df$Weekend)
df$Revenue = factor(df$Revenue)
df$Month = factor(df$Month)
head(df)
## Administrative Administrative_Duration Informational Informational_Duration
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues
## 1 1 0.00 0.2000 0.2000 0
## 2 2 64.00 0.0000 0.1000 0
## 3 1 0.00 0.2000 0.2000 0
## 4 2 2.67 0.0500 0.1400 0
## 5 10 627.50 0.0200 0.0500 0
## 6 19 154.22 0.0158 0.0246 0
## SpecialDay Month OperatingSystems Browser Region TrafficType VisitorType
## 1 0 3 1 1 1 1 3
## 2 0 3 2 2 1 2 3
## 3 0 3 4 1 9 3 3
## 4 0 3 3 2 2 4 3
## 5 0 3 3 3 1 4 3
## 6 0 3 2 2 1 3 3
## Weekend Revenue
## 1 1 1
## 2 1 1
## 3 1 1
## 4 1 1
## 5 2 1
## 6 1 1
ggplot(df_orig, aes(x=OperatingSystems, color=Revenue, fill=Revenue)) +
# geom_histogram(bins = 8, fill="pink") +
# geom_freqpoly(bins=8, color="red") +
geom_histogram(binwidth = 0.5, alpha=0.6, position = 'identity') +
scale_color_viridis(discrete=TRUE) +
scale_fill_viridis(discrete=TRUE) +
scale_x_continuous(breaks=1:8) + scale_y_continuous(n.breaks=6) +
theme_ipsum()
ggplot(df_orig, aes(x=Browser, color=Revenue, fill=Revenue)) +
geom_histogram(binwidth = 0.5, alpha=0.6, position = 'identity') +
scale_x_continuous(breaks=1:13, ) + scale_y_continuous(n.breaks=6) +
theme_ipsum()
ggplot(df_orig, aes(x=TrafficType, color=Revenue, fill=Revenue)) +
geom_histogram(binwidth = 1, alpha=0.6) +
scale_fill_manual(values=c("orange", "blue")) +
scale_x_continuous(n.breaks=20) + scale_y_continuous(n.breaks=6) +
theme_ipsum()
ggplot(df_orig, aes(x=VisitorType, color=Revenue, fill=Revenue)) +
geom_histogram(alpha=0.6, stat="count") +
scale_fill_manual(values=c("pink", "green")) + scale_color_manual(values=c("pink", "green")) +
theme_ipsum()
ggplot(df_orig, aes(x=Weekend, color=Revenue, fill=Revenue)) +
geom_histogram(alpha=0.6, stat="count") +
scale_fill_manual(values=c("cyan", "gray")) + scale_color_manual(values=c("cyan", "gray")) +
theme_ipsum()
ggplot(df_orig, aes(x=Region, color=Revenue, fill=Revenue)) +
geom_histogram(binwidth = 1, alpha=0.6, stat="count") + stat_bin(center=1) +
scale_x_continuous(breaks=1:13, ) + scale_y_continuous(n.breaks=6) +
theme_ipsum()
Plotting highlights the imbalance between positive and negative classes
in our response variable (Revenue). Also distribution is right
skewed.
We will perform Chi-Square test on all the last 6 categorical variables.
os_contable = table(df_orig$Revenue, df_orig$OperatingSystems)
os_contable
##
## 1 2 3 4 5 6 7 8
## FALSE 2206 5446 2287 393 5 17 6 62
## TRUE 379 1155 268 85 1 2 1 17
chisq.test(os_contable, simulate.p.value=T)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: os_contable
## X-squared = 75, df = NA, p-value = 5e-04
br_contable = table(df_orig$Revenue, df_orig$Browser)
br_contable
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## FALSE 2097 6738 100 606 381 154 43 114 1 131 5 7 45
## TRUE 365 1223 5 130 86 20 6 21 0 32 1 3 16
chisq.test(br_contable, simulate.p.value=T)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: br_contable
## X-squared = 28, df = NA, p-value = 0.008
rg_contable = table(df_orig$Revenue, df_orig$Region)
rg_contable
##
## 1 2 3 4 5 6 7 8 9
## FALSE 4009 948 2054 1007 266 693 642 378 425
## TRUE 771 188 349 175 52 112 119 56 86
chisq.test(rg_contable, simulate.p.value=F)
##
## Pearson's Chi-squared test
##
## data: rg_contable
## X-squared = 9, df = 8, p-value = 0.3
Tt_contable = table(df_orig$Revenue, df_orig$TrafficType)
Tt_contable
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## FALSE 2189 3066 1872 904 204 391 28 248 38 360 200 1 695 11
## TRUE 262 847 180 165 56 53 12 95 4 90 47 0 43 2
##
## 15 16 17 18 19 20
## FALSE 38 2 1 10 16 148
## TRUE 0 1 0 0 1 50
chisq.test(Tt_contable, simulate.p.value=T)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: Tt_contable
## X-squared = 373, df = NA, p-value = 5e-04
Vt_contable = table(df_orig$Revenue, df_orig$VisitorType)
Vt_contable
##
## New_Visitor Other Returning_Visitor
## FALSE 1272 69 9081
## TRUE 422 16 1470
chisq.test(Vt_contable, simulate.p.value=F)
##
## Pearson's Chi-squared test
##
## data: Vt_contable
## X-squared = 135, df = 2, p-value <2e-16
wk_contable = table(df_orig$Revenue, df_orig$Weekend)
wk_contable
##
## FALSE TRUE
## FALSE 8053 2369
## TRUE 1409 499
chisq.test(wk_contable, simulate.p.value=F)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: wk_contable
## X-squared = 10, df = 1, p-value = 0.001
From above testing, all the variables except Region are statistically significant (p-value < 0.05). Region has p-value of 0.3 (>0.05) making it statistically insignificant. Hence we will omit the Region variable from our modeling.
set.seed(1234)
df_sample <- sample(2, nrow(df), replace=TRUE, prob=c(0.80, 0.20))
train <- df[df_sample==1, ]
# test_X <- df[df_sample==2, 1:17]
#
# train_Y <- df[df_sample==1, 18]
test <- df[df_sample==2, ]
We will choose “maxdepth” parameter for our decision tree. For this we will be using original unbalanced dataset only.
loadPkg("rpart")
loadPkg("caret")
confusionMatrixResultDf = data.frame( Depth=numeric(0), Accuracy= numeric(0), Sensitivity=numeric(0), Specificity=numeric(0), Pos.Pred.Value=numeric(0), Neg.Pred.Value=numeric(0), Precision=numeric(0), Recall=numeric(0), F1=numeric(0), Prevalence=numeric(0), Detection.Rate=numeric(0), Detection.Prevalence=numeric(0), Balanced.Accuracy=numeric(0), row.names = NULL )
for (deep in 2:7) {
tree_fit <- rpart(Revenue ~ . - Region, data=train, method="class", control = list(maxdepth = deep, cp=0.005))
#
cm = confusionMatrix(predict(tree_fit, test, type = "class"), reference = test$Revenue, positive = "2") # from caret library
#
cmaccu = cm$overall['Accuracy']
# print( paste("Total Accuracy = ", cmaccu ) )
#
cmt = data.frame(Depth=deep, Accuracy = cmaccu, row.names = NULL ) # initialize a row of the metrics
cmt = cbind( cmt, data.frame( t(cm$byClass) ) ) # the dataframe of the transpose, with k valued added in front
confusionMatrixResultDf = rbind(confusionMatrixResultDf, cmt)
# print("Other metrics : ")
}
xkabledply(confusionMatrixResultDf, title="Kyphosis Classification Trees summary with varying MaxDepth")
| Depth | Accuracy | Sensitivity | Specificity | Pos.Pred.Value | Neg.Pred.Value | Precision | Recall | F1 | Prevalence | Detection.Rate | Detection.Prevalence | Balanced.Accuracy |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 0.876 | 0.387 | 0.972 | 0.724 | 0.891 | 0.724 | 0.387 | 0.504 | 0.163 | 0.0628 | 0.0867 | 0.679 |
| 3 | 0.881 | 0.571 | 0.941 | 0.652 | 0.919 | 0.652 | 0.571 | 0.609 | 0.163 | 0.0928 | 0.1422 | 0.756 |
| 4 | 0.886 | 0.539 | 0.954 | 0.695 | 0.914 | 0.695 | 0.539 | 0.607 | 0.163 | 0.0875 | 0.1260 | 0.746 |
| 5 | 0.885 | 0.584 | 0.944 | 0.669 | 0.921 | 0.669 | 0.584 | 0.623 | 0.163 | 0.0948 | 0.1418 | 0.764 |
| 6 | 0.889 | 0.546 | 0.956 | 0.704 | 0.916 | 0.704 | 0.546 | 0.615 | 0.163 | 0.0887 | 0.1260 | 0.751 |
| 7 | 0.889 | 0.546 | 0.956 | 0.704 | 0.916 | 0.704 | 0.546 | 0.615 | 0.163 | 0.0887 | 0.1260 | 0.751 |
Maxdepth equals 6 is giving highest accuracy, precision, and recall. So we will use to perform further evaluation using maxdepth as 6.
dtree_model <- rpart(Revenue ~ . - Region, data=train, method="class", control = list(maxdepth = 6, cp=0.005) )
confusionMatrix( predict(dtree_model, test, type = "class"), reference = test[, "Revenue"], positive = "2")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 1975 182
## 2 92 219
##
## Accuracy : 0.889
## 95% CI : (0.876, 0.901)
## No Information Rate : 0.838
## P-Value [Acc > NIR] : 2.25e-13
##
## Kappa : 0.552
##
## Mcnemar's Test P-Value : 7.59e-08
##
## Sensitivity : 0.5461
## Specificity : 0.9555
## Pos Pred Value : 0.7042
## Neg Pred Value : 0.9156
## Prevalence : 0.1625
## Detection Rate : 0.0887
## Detection Prevalence : 0.1260
## Balanced Accuracy : 0.7508
##
## 'Positive' Class : 2
##
We are getting 88.9 as accuracy for above tree model. But we can observe sensitivity or recall is low because it depends on False Negative (FN) that is 182.
plotcp(dtree_model)
Above graph of relative error vs. depth/cp shows that after maxdepth
equals 6 there is saturation in error.
loadPkg("rattle") # For fancyRpartPlot (Trees) Answer "no" on installing from binary source
fancyRpartPlot(dtree_model)
library(pROC)
prob = predict(dtree_model, test, type = "class")
# df$prob=prob
h <- roc(test$Revenue ~ as.numeric(prob))
auc(h) # area-under-curve prefer 0.8 or higher.
## Area under the curve: 0.751
plot(h)
Above ROC-AUC analysis shows that imbalance in dataset is not giving
good AUC score (<0.8). This is because specificity is much higher
than sensitivity.
Now we will treat our unbalance dataset. True values are very less as compared to False. So we will oversample True observation of Revenue equal to False value for each corresponding variable in dataset.
train_os = data.frame(train)
train_new <- ovun.sample(Revenue ~ ., data = train_os, method = "over",N = 16710)$data
table(train_new$Revenue)
##
## 1 2
## 8355 8355
Above table shows that True and False observations are equal.
confusionMatrixResultDf = data.frame( Depth=numeric(0), Accuracy= numeric(0), Sensitivity=numeric(0), Specificity=numeric(0), Pos.Pred.Value=numeric(0), Neg.Pred.Value=numeric(0), Precision=numeric(0), Recall=numeric(0), F1=numeric(0), Prevalence=numeric(0), Detection.Rate=numeric(0), Detection.Prevalence=numeric(0), Balanced.Accuracy=numeric(0), row.names = NULL )
for (deep in 2:10) {
tree_fit <- rpart(Revenue ~ . - Region, data=train_new, method="class", control = list(maxdepth = deep, cp=0.001) )
#
cm = confusionMatrix( predict(tree_fit, test, type = "class"), reference = test[, "Revenue"], positive = "2") # from caret library
#
cmaccu = cm$overall['Accuracy']
# print( paste("Total Accuracy = ", cmaccu ) )
#
cmt = data.frame(Depth=deep, Accuracy = cmaccu, row.names = NULL ) # initialize a row of the metrics
cmt = cbind( cmt, data.frame( t(cm$byClass) ) ) # the dataframe of the transpose, with k valued added in front
confusionMatrixResultDf = rbind(confusionMatrixResultDf, cmt)
# print("Other metrics : ")
}
xkabledply(confusionMatrixResultDf, title="Kyphosis Classification Trees summary with varying MaxDepth")
| Depth | Accuracy | Sensitivity | Specificity | Pos.Pred.Value | Neg.Pred.Value | Precision | Recall | F1 | Prevalence | Detection.Rate | Detection.Prevalence | Balanced.Accuracy |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 0.857 | 0.748 | 0.879 | 0.544 | 0.947 | 0.544 | 0.748 | 0.630 | 0.163 | 0.122 | 0.223 | 0.813 |
| 3 | 0.838 | 0.785 | 0.848 | 0.501 | 0.953 | 0.501 | 0.785 | 0.612 | 0.163 | 0.128 | 0.255 | 0.817 |
| 4 | 0.841 | 0.785 | 0.852 | 0.507 | 0.953 | 0.507 | 0.785 | 0.616 | 0.163 | 0.128 | 0.252 | 0.819 |
| 5 | 0.842 | 0.798 | 0.851 | 0.510 | 0.956 | 0.510 | 0.798 | 0.622 | 0.163 | 0.130 | 0.255 | 0.825 |
| 6 | 0.842 | 0.803 | 0.849 | 0.508 | 0.957 | 0.508 | 0.803 | 0.622 | 0.163 | 0.131 | 0.257 | 0.826 |
| 7 | 0.836 | 0.805 | 0.842 | 0.498 | 0.957 | 0.498 | 0.805 | 0.615 | 0.163 | 0.131 | 0.263 | 0.824 |
| 8 | 0.841 | 0.788 | 0.851 | 0.506 | 0.954 | 0.506 | 0.788 | 0.617 | 0.163 | 0.128 | 0.253 | 0.820 |
| 9 | 0.842 | 0.788 | 0.853 | 0.510 | 0.954 | 0.510 | 0.788 | 0.619 | 0.163 | 0.128 | 0.251 | 0.821 |
| 10 | 0.842 | 0.788 | 0.852 | 0.509 | 0.954 | 0.509 | 0.788 | 0.618 | 0.163 | 0.128 | 0.252 | 0.820 |
dtree_model <- rpart(Revenue ~ . - Region, data=train_new, method="class", control = list(maxdepth = 2, cp=0.001) )
confusionMatrix(predict(dtree_model, test, type = "class"), reference = test$Revenue, positive = "2")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 1816 101
## 2 251 300
##
## Accuracy : 0.857
## 95% CI : (0.843, 0.871)
## No Information Rate : 0.838
## P-Value [Acc > NIR] : 0.00362
##
## Kappa : 0.545
##
## Mcnemar's Test P-Value : 1.99e-15
##
## Sensitivity : 0.748
## Specificity : 0.879
## Pos Pred Value : 0.544
## Neg Pred Value : 0.947
## Prevalence : 0.162
## Detection Rate : 0.122
## Detection Prevalence : 0.223
## Balanced Accuracy : 0.813
##
## 'Positive' Class : 2
##
After oversampling our accuracy is 85.7, which is lower than from previous model with unbalanced dataset.
prob = predict(dtree_model, test, type = "class")
# df$prob=prob
h <- roc(test$Revenue ~ as.numeric(prob))
auc(h) # area-under-curve prefer 0.8 or higher.
## Area under the curve: 0.813
plot(h)
Oversampling causes sensitivity to increase because FN is increased. AUC
score is 81.3 now (>0.8).
fancyRpartPlot(dtree_model)
unloadPkg("Caret")
#First, load the file
online = data.frame(read.csv("online_shoppers_intention.csv"))
# We will use this dataframe later for model manipulation
original_online = data.frame(read.csv("online_shoppers_intention.csv"))
total_observations = nrow(online)
total_observations
head(online)
# Handling categorical variables prior to plotting
online$Month = as.factor(online$Month)
online$Revenue = as.factor(online$Revenue)
online$Weekend = as.factor(online$Weekend)
online$VisitorType = as.factor(online$VisitorType)
monthFrequency = ggplot(online, aes(x=Month, color=Month, fill=Month)) + geom_bar() + labs(title="Count of Months",
x ="Month", y = "Count")
monthFrequency
userCount = ggplot(online, aes(x=VisitorType, color=VisitorType, fill=VisitorType)) + geom_bar() + labs(title="Visitor Type Count",
x ="Visitor Types", y = "Count")
userCount
revCount = ggplot(online, aes(x=Revenue, color=Revenue, fill=Revenue)) + geom_bar() + labs(title="Revenue Count",
x ="Revenue", y = "Count")
revCount
revTrue = subset(online, online$Revenue == TRUE)
revFalse = subset(online, online$Revenue == FALSE)
pieNum <- c(nrow(revTrue), nrow(revFalse))
piepercent<- round(100*pieNum/sum(pieNum), 1)
pie(pieNum, labels = piepercent, main = "Revenue Percentages",col = rainbow(length(pieNum)))
legend("topright", c("True","False"), cex = 0.8, fill = rainbow(length(pieNum)))
# Special Day data
specialDay = ggplot(online, aes(SpecialDay)) + geom_density(color="darkblue", fill="lightblue") + labs(title="Density Chart of Special Day",x ="Special Day", y = "Density")
specialDay
# scatterSpecialDay = ggplot(online, aes(Revenue, SpecialDay)) + geom_violin(), this plot had no significance
admin = ggplot(online, aes(Administrative, color=Revenue, fill=Revenue)) + geom_histogram(bins=30) + labs(title="Count of Administrative", x ="Administrative", y = "Count")
admin
info = ggplot(online, aes(Informational, color=Revenue, fill=Revenue)) + geom_histogram(bins=30) + labs(title="Count of Informational", x ="Informational", y = "Count")
info
prod = ggplot(online, aes(ProductRelated, color=Revenue, fill=Revenue)) + geom_histogram(bins=30) + labs(title="Count of Product Related", x ="Product Related", y = "Count")
prod
#
# Duration of the above Pages
#
adminDur = ggplot(online, aes(Administrative_Duration, color=Revenue, fill=Revenue)) + geom_histogram(bins=30) + labs(title="Count of Administrative Duration", x ="Administrative Duration", y = "Count")
adminDur
infoDur = ggplot(online, aes(Informational_Duration, color=Revenue, fill=Revenue)) + geom_histogram(bins=30) + labs(title="Count of Informational Duration", x ="Informational Duration", y = "Count")
infoDur
prodDur = ggplot(online, aes(ProductRelated_Duration, color=Revenue, fill=Revenue)) + geom_histogram(bins=30) + labs(title="Count of Product Related Duration", x ="Product Related Duration", y = "Count")
prodDur
# Bounce Rate is measured by total one page visits divided by total visits
bounceBox = ggplot(online, aes(BounceRates)) + geom_boxplot(color='blue') + labs(title="Boxplot of Bounce Rate", x ="Bounce Rate")
bounceBox
# Exit Rate is measured by total exits from page divided by total visits to page
exitBox = ggplot(online, aes(ExitRates)) + geom_boxplot(color='orange') + labs(title="Boxplot of Exit Rate", x ="Exit Rate")
exitBox
# Page Value is the average value for a page that a user visited before landing on the goal page or completing an Ecommerce transaction
pageBox_rev = ggplot(revTrue, aes(PageValues)) + geom_boxplot(color='green') + labs(title="Boxplot of Page Values with Revenue", x ="Page Values")
pageBox_rev
pageBox_no_rev = ggplot(revFalse, aes(PageValues)) + geom_boxplot(color='red') + labs(title="Boxplot of Page Values of No Revenue Rows", x ="Page Values")
pageBox_no_rev
The average Bounce Rate for this site is 2.219%. The average Exit Rate
for this site is 4.307%. Page Values that had revenue had an average of
27.265. Page Values that had did not have revenue had an average of
1.976. For our dataset it would be better to see individual page exit
and bounce rates. The column variables Exit and Bounce Rates are the
average of these metrics over each page on this site.
# Corrplot after changing categorical to numerical
online_new = subset(online, select = -c(VisitorType, Month))
online_new$Weekend = as.numeric(online_new$Weekend)
online_new$Revenue = as.numeric(online_new$Revenue)
corrOnline <- cor(online_new)
corrplot(corrOnline, method="number",number.cex=0.5,tl.cex=.6)
# Table for Month and Revenue of that Month
monthTable = xtabs( ~ Revenue + Month, data = online)
monthTable
chisqMonth = chisq.test(monthTable)
chisqMonth
# Table for Special Day and Revenue as we get closer to that Day
# This value represents the closeness of the browsing date to special days or holidays
specialDayTable = xtabs( ~ Revenue + SpecialDay, data = online)
specialDayTable
chisqSD = chisq.test(specialDayTable)
chisqSD
library(tidyr)
# Rehandle variables for knn model
online$VisitorType = as.factor(online$VisitorType)
online$Weekend = as.integer(online$Weekend)
online$OperatingSystems = as.integer(online$OperatingSystems)
online$Browser = as.integer(online$Browser)
online$TrafficType = as.integer(online$TrafficType)
online$Region = as.integer(online$Region)
online$Month = as.integer(factor(online$Month, levels = month.abb))
online = drop_na(online) # dropping wrong month rows
# Making Visitor Type into numerical variable
library(tidyverse)
online$VisitorType <- str_replace(online$VisitorType,'New_Visitor','1')
online$VisitorType <- str_replace(online$VisitorType,'Other','2')
online$VisitorType <- str_replace(online$VisitorType,'Returning_Visitor','3')
online$VisitorType = as.numeric(online$VisitorType)
# handling the variables for the specific variables
# PageValues + Month + ExitRates + ProductRelated_Duration + TrafficType + VisitorType + ProductRelated
original_online$VisitorType = as.factor(original_online$VisitorType)
original_online$Weekend = as.integer(original_online$Weekend)
original_online$OperatingSystems = as.integer(original_online$OperatingSystems)
original_online$Browser = as.integer(original_online$Browser)
original_online$TrafficType = as.integer(original_online$TrafficType)
original_online$Region = as.integer(original_online$Region)
original_online$Month = as.integer(factor(original_online$Month, levels = month.abb))
original_online = drop_na(online)
# Making Visitor Type into numerical variable
original_online$VisitorType <- str_replace(original_online$VisitorType,'New_Visitor','1')
original_online$VisitorType <- str_replace(original_online$VisitorType,'Other','2')
original_online$VisitorType <- str_replace(original_online$VisitorType,'Returning_Visitor','3')
original_online$VisitorType = as.numeric(original_online$VisitorType)
original_online = subset(original_online, select = -c(1,2,3,4,7,10,12,13,14,17))
scaledOnline <- as.data.frame(scale(online[1:17], center = TRUE, scale = TRUE))
# Selected Variables based off EDA
scaled_Online_select <- as.data.frame(scale(original_online[1:7], center = TRUE, scale = TRUE))
set.seed(123)
online_sample <- sample(2, nrow(scaledOnline), replace=TRUE, prob=c(0.80, 0.20))
# Selected Variables Model based off EDA
set.seed(123)
online_sample_selected <- sample(2, nrow(scaled_Online_select), replace=TRUE, prob=c(0.8, 0.2))
online_training <- scaledOnline[online_sample==1, 1:17]
online_test <- scaledOnline[online_sample==2, 1:17]
# Selected Variables Model based off EDA
online_training_selected <- scaled_Online_select[online_sample_selected==1, 1:7]
online_test_selected <- scaled_Online_select[online_sample_selected==2, 1:7]
online_trainLabels <- online[online_sample==1, 18]
online_testLabels <- online[online_sample==2, 18]
# Selected Variables Model based off EDA
online_trainLabels_selected <- original_online[online_sample_selected==1, 8]
online_testLabels_selected <- original_online[online_sample_selected==2, 8]
ResultDf = data.frame( k=numeric(0), Total.Accuracy= numeric(0), row.names = NULL )
ResultDf_selected = data.frame( k=numeric(0), Total.Accuracy= numeric(0), row.names = NULL )
library(class)
library(gmodels)
for (kval in 5:25) {
online_pred <- knn(train = online_training, test = online_test, cl=online_trainLabels, k=kval)
onlinePREDCross <- CrossTable(online_testLabels, online_pred, prop.chisq = FALSE)
print( paste("k = ", kval) )
onlinePREDCross
library(caret)
cm = confusionMatrix(online_pred, reference = online_testLabels )
cmaccu = cm$overall['Accuracy']
print( paste("Total Accuracy = ", cmaccu ) )
cmt = data.frame(k=kval, Total.Accuracy = cmaccu, row.names = NULL )
ResultDf = rbind(ResultDf, cmt)
print( xkabledply( as.matrix(cm), title = paste("ConfusionMatrix for k = ",kval ) ) )
print( xkabledply(data.frame(cm$byClass), title=paste("k = ",kval)) )
}
# Selected Variable Model
for (kval in 10:30) {
online_pred_selected <- knn(train = online_training_selected, test = online_test_selected, cl=online_trainLabels_selected, k=kval)
onlinePREDCross_selected <- CrossTable(online_testLabels_selected, online_pred_selected, prop.chisq = FALSE)
print( paste("k = ", kval) )
onlinePREDCross_selected
cm_selected = confusionMatrix(online_pred_selected, reference = online_testLabels_selected )
cmaccu_selected = cm_selected$overall['Accuracy']
print( paste("Total Accuracy = ", cmaccu_selected ) )
cmt_selected = data.frame(k=kval, Total.Accuracy = cmaccu_selected, row.names = NULL )
ResultDf_selected = rbind(ResultDf_selected, cmt_selected)
print( xkabledply( as.matrix(cm_selected), title = paste("ConfusionMatrix for k = ",kval ) ) )
print( xkabledply(data.frame(cm_selected$byClass), title=paste("k = ",kval)) )
}
xkabledply(ResultDf, "Total Accuracy Summary")
| k | Total.Accuracy |
|---|---|
| 5 | 0.861 |
| 6 | 0.860 |
| 7 | 0.863 |
| 8 | 0.862 |
| 9 | 0.862 |
| 10 | 0.862 |
| 11 | 0.863 |
| 12 | 0.862 |
| 13 | 0.865 |
| 14 | 0.864 |
| 15 | 0.867 |
| 16 | 0.865 |
| 17 | 0.867 |
| 18 | 0.867 |
| 19 | 0.867 |
| 20 | 0.868 |
| 21 | 0.865 |
| 22 | 0.865 |
| 23 | 0.867 |
| 24 | 0.865 |
| 25 | 0.866 |
# Selected Variable Model
xkabledply(ResultDf_selected, "Total Accuracy Summary")
| k | Total.Accuracy |
|---|---|
| 10 | 0.887 |
| 11 | 0.886 |
| 12 | 0.887 |
| 13 | 0.885 |
| 14 | 0.884 |
| 15 | 0.886 |
| 16 | 0.883 |
| 17 | 0.886 |
| 18 | 0.887 |
| 19 | 0.887 |
| 20 | 0.887 |
| 21 | 0.887 |
| 22 | 0.888 |
| 23 | 0.885 |
| 24 | 0.885 |
| 25 | 0.887 |
| 26 | 0.888 |
| 27 | 0.886 |
| 28 | 0.887 |
| 29 | 0.887 |
| 30 | 0.884 |
kPlot = ggplot(ResultDf, aes(k, Total.Accuracy)) + geom_point(colour = "blue", size = 3) + geom_line(color='orange')
kPlot
kPlot_selected = ggplot(ResultDf_selected, aes(k, Total.Accuracy)) + geom_point(colour = "blue", size = 3) + geom_line(color='orange')
kPlot_selected
# Max Accuracy
online_pred_max <- knn(train = online_training, test = online_test, cl=online_trainLabels, k=15)
onlinePREDCross_max <- CrossTable(online_testLabels, online_pred, prop.chisq = FALSE)
# Selected Variable Model
online_pred_max_selected <- knn(train = online_training_selected, test = online_test_selected, cl=online_trainLabels_selected, k=25)
onlinePREDCross_max_selected <- CrossTable(online_testLabels_selected, online_pred_selected, prop.chisq = FALSE)
loadPkg("pROC")
# model with every variable
h <- roc(online_testLabels , as.integer(online_pred_max))
auc(h)
plot(h)
# model with specific variables
h_selected <- roc(online_testLabels_selected , as.integer(online_pred_max_selected))
auc(h_selected)
plot(h_selected)
Sadly, both models do not meet the cut off AUC metric of 0.8. We can see from the Confusion Matrix that when the models are at the optimal k value the total accuracy are in the high 80 percent levels. However, the issue we need to address is the specificity that is measured by taking the true negatives divided by the true negatives plus the false positives. For both models we built we can see that there are small specificity rates.